49 research outputs found

    Mining Feature Relationships in Data

    Full text link
    When faced with a new dataset, most practitioners begin by performing exploratory data analysis to discover interesting patterns and characteristics within data. Techniques such as association rule mining are commonly applied to uncover relationships between features (attributes) of the data. However, association rules are primarily designed for use on binary or categorical data, due to their use of rule-based machine learning. A large proportion of real-world data is continuous in nature, and discretisation of such data leads to inaccurate and less informative association rules. In this paper, we propose an alternative approach called feature relationship mining (FRM), which uses a genetic programming approach to automatically discover symbolic relationships between continuous or categorical features in data. To the best of our knowledge, our proposed approach is the first such symbolic approach with the goal of explicitly discovering relationships between features. Empirical testing on a variety of real-world datasets shows the proposed method is able to find high-quality, simple feature relationships which can be easily interpreted and which provide clear and non-trivial insight into data.Comment: 16 pages, accepted in EuroGP '2

    Genetic Programming for Evolving Similarity Functions for Clustering: Representations and Analysis

    Full text link
    Clustering is a difficult and widely-studied data mining task, with many varieties of clustering algorithms proposed in the literature. Nearly all algorithms use a similarity measure such as a distance metric (e.g. Euclidean distance) to decide which instances to assign to the same cluster. These similarity measures are generally pre-defined and cannot be easily tailored to the properties of a particular dataset, which leads to limitations in the quality and the interpretability of the clusters produced. In this paper, we propose a new approach to automatically evolving similarity functions for a given clustering algorithm by using genetic programming. We introduce a new genetic programming-based method which automatically selects a small subset of features (feature selection) and then combines them using a variety of functions (feature construction) to produce dynamic and flexible similarity functions that are specifically designed for a given dataset. We demonstrate how the evolved similarity functions can be used to perform clustering using a graph-based representation. The results of a variety of experiments across a range of large, high-dimensional datasets show that the proposed approach can achieve higher and more consistent performance than the benchmark methods. We further extend the proposed approach to automatically produce multiple complementary similarity functions by using a multi-tree approach, which gives further performance improvements. We also analyse the interpretability and structure of the automatically evolved similarity functions to provide insight into how and why they are superior to standard distance metrics.Comment: 29 pages, accepted by Evolutionary Computation (Journal), MIT Pres

    Generating Redundant Features with Unsupervised Multi-Tree Genetic Programming

    Full text link
    Recently, feature selection has become an increasingly important area of research due to the surge in high-dimensional datasets in all areas of modern life. A plethora of feature selection algorithms have been proposed, but it is difficult to truly analyse the quality of a given algorithm. Ideally, an algorithm would be evaluated by measuring how well it removes known bad features. Acquiring datasets with such features is inherently difficult, and so a common technique is to add synthetic bad features to an existing dataset. While adding noisy features is an easy task, it is very difficult to automatically add complex, redundant features. This work proposes one of the first approaches to generating redundant features, using a novel genetic programming approach. Initial experiments show that our proposed method can automatically create difficult, redundant features which have the potential to be used for creating high-quality feature selection benchmark datasets.Comment: 16 pages, preprint for EuroGP '1

    Differentiable Genetic Programming for High-dimensional Symbolic Regression

    Full text link
    Symbolic regression (SR) is the process of discovering hidden relationships from data with mathematical expressions, which is considered an effective way to reach interpretable machine learning (ML). Genetic programming (GP) has been the dominator in solving SR problems. However, as the scale of SR problems increases, GP often poorly demonstrates and cannot effectively address the real-world high-dimensional problems. This limitation is mainly caused by the stochastic evolutionary nature of traditional GP in constructing the trees. In this paper, we propose a differentiable approach named DGP to construct GP trees towards high-dimensional SR for the first time. Specifically, a new data structure called differentiable symbolic tree is proposed to relax the discrete structure to be continuous, thus a gradient-based optimizer can be presented for the efficient optimization. In addition, a sampling method is proposed to eliminate the discrepancy caused by the above relaxation for valid symbolic expressions. Furthermore, a diversification mechanism is introduced to promote the optimizer escaping from local optima for globally better solutions. With these designs, the proposed DGP method can efficiently search for the GP trees with higher performance, thus being capable of dealing with high-dimensional SR. To demonstrate the effectiveness of DGP, we conducted various experiments against the state of the arts based on both GP and deep neural networks. The experiment results reveal that DGP can outperform these chosen peer competitors on high-dimensional regression benchmarks with dimensions varying from tens to thousands. In addition, on the synthetic SR problems, the proposed DGP method can also achieve the best recovery rate even with different noisy levels. It is believed this work can facilitate SR being a powerful alternative to interpretable ML for a broader range of real-world problems

    Developing a core outcome set for future infertility research : An international consensus development study

    Get PDF
    STUDY QUESTION: Can a core outcome set to standardize outcome selection, collection and reporting across future infertility research be developed? SUMMARY ANSWER: A minimum data set, known as a core outcome set, has been developed for randomized controlled trials (RCTs) and systematic reviews evaluating potential treatments for infertility. WHAT IS KNOWN ALREADY: Complex issues, including a failure to consider the perspectives of people with fertility problems when selecting outcomes, variations in outcome definitions and the selective reporting of outcomes on the basis of statistical analysis, make the results of infertility research difficult to interpret. STUDY DESIGN, SIZE, DURATION: A three-round Delphi survey (372 participants from 41 countries) and consensus development workshop (30 participants from 27 countries). PARTICIPANTS/MATERIALS, SETTING, METHODS: Healthcare professionals, researchers and people with fertility problems were brought together in an open and transparent process using formal consensus science methods. MAIN RESULTS AND THE ROLE OF CHANCE: The core outcome set consists of: viable intrauterine pregnancy confirmed by ultrasound (accounting for singleton, twin and higher multiple pregnancy); pregnancy loss (accounting for ectopic pregnancy, miscarriage, stillbirth and termination of pregnancy); live birth; gestational age at delivery; birthweight; neonatal mortality; and major congenital anomaly. Time to pregnancy leading to live birth should be reported when applicable. LIMITATIONS, REASONS FOR CAUTION: We used consensus development methods which have inherent limitations, including the representativeness of the participant sample, Delphi survey attrition and an arbitrary consensus threshold. WIDER IMPLICATIONS OF THE FINDINGS: Embedding the core outcome set within RCTs and systematic reviews should ensure the comprehensive selection, collection and reporting of core outcomes. Research funding bodies, the Standard Protocol Items: Recommendations for Interventional Trials (SPIRIT) statement, and over 80 specialty journals, including the Cochrane Gynaecology and Fertility Group, Fertility and Sterility and Human Reproduction, have committed to implementing this core outcome set. STUDY FUNDING/COMPETING INTEREST(S): This research was funded by the Catalyst Fund, Royal Society of New Zealand, Auckland Medical Research Fund and Maurice and Phyllis Paykel Trust. The funder had no role in the design and conduct of the study, the collection, management, analysis or interpretation of data, or manuscript preparation. B.W.J.M. is supported by a National Health and Medical Research Council Practitioner Fellowship (GNT1082548). S.B. was supported by University of Auckland Foundation Seelye Travelling Fellowship. S.B. reports being the Editor-in-Chief of Human Reproduction Open and an editor of the Cochrane Gynaecology and Fertility group. J.L.H.E. reports being the Editor Emeritus of Human Reproduction. J.M.L.K. reports research sponsorship from Ferring and Theramex. R.S.L. reports consultancy fees from Abbvie, Bayer, Ferring, Fractyl, Insud Pharma and Kindex and research sponsorship from Guerbet and Hass Avocado Board. B.W.J.M. reports consultancy fees from Guerbet, iGenomix, Merck, Merck KGaA and ObsEva. C.N. reports being the Co Editor-in-Chief of Fertility and Sterility and Section Editor of the Journal of Urology, research sponsorship from Ferring, and retains a financial interest in NexHand. A.S. reports consultancy fees from Guerbet. E.H.Y.N. reports research sponsorship from Merck. N.L.V. reports consultancy and conference fees from Ferring, Merck and Merck Sharp and Dohme. The remaining authors declare no competing interests in relation to the work presented. All authors have completed the disclosure form

    Dolutegravir twice-daily dosing in children with HIV-associated tuberculosis: a pharmacokinetic and safety study within the open-label, multicentre, randomised, non-inferiority ODYSSEY trial

    Get PDF
    Background: Children with HIV-associated tuberculosis (TB) have few antiretroviral therapy (ART) options. We aimed to evaluate the safety and pharmacokinetics of dolutegravir twice-daily dosing in children receiving rifampicin for HIV-associated TB. Methods: We nested a two-period, fixed-order pharmacokinetic substudy within the open-label, multicentre, randomised, controlled, non-inferiority ODYSSEY trial at research centres in South Africa, Uganda, and Zimbabwe. Children (aged 4 weeks to <18 years) with HIV-associated TB who were receiving rifampicin and twice-daily dolutegravir were eligible for inclusion. We did a 12-h pharmacokinetic profile on rifampicin and twice-daily dolutegravir and a 24-h profile on once-daily dolutegravir. Geometric mean ratios for trough plasma concentration (Ctrough), area under the plasma concentration time curve from 0 h to 24 h after dosing (AUC0–24 h), and maximum plasma concentration (Cmax) were used to compare dolutegravir concentrations between substudy days. We assessed rifampicin Cmax on the first substudy day. All children within ODYSSEY with HIV-associated TB who received rifampicin and twice-daily dolutegravir were included in the safety analysis. We described adverse events reported from starting twice-daily dolutegravir to 30 days after returning to once-daily dolutegravir. This trial is registered with ClinicalTrials.gov (NCT02259127), EudraCT (2014–002632-14), and the ISRCTN registry (ISRCTN91737921). Findings: Between Sept 20, 2016, and June 28, 2021, 37 children with HIV-associated TB (median age 11·9 years [range 0·4–17·6], 19 [51%] were female and 18 [49%] were male, 36 [97%] in Africa and one [3%] in Thailand) received rifampicin with twice-daily dolutegravir and were included in the safety analysis. 20 (54%) of 37 children enrolled in the pharmacokinetic substudy, 14 of whom contributed at least one evaluable pharmacokinetic curve for dolutegravir, including 12 who had within-participant comparisons. Geometric mean ratios for rifampicin and twice-daily dolutegravir versus once-daily dolutegravir were 1·51 (90% CI 1·08–2·11) for Ctrough, 1·23 (0·99–1·53) for AUC0–24 h, and 0·94 (0·76–1·16) for Cmax. Individual dolutegravir Ctrough concentrations were higher than the 90% effective concentration (ie, 0·32 mg/L) in all children receiving rifampicin and twice-daily dolutegravir. Of 18 children with evaluable rifampicin concentrations, 15 (83%) had a Cmax of less than the optimal target concentration of 8 mg/L. Rifampicin geometric mean Cmax was 5·1 mg/L (coefficient of variation 71%). During a median follow-up of 31 weeks (IQR 30–40), 15 grade 3 or higher adverse events occurred among 11 (30%) of 37 children, ten serious adverse events occurred among eight (22%) children, including two deaths (one tuberculosis-related death, one death due to traumatic injury); no adverse events, including deaths, were considered related to dolutegravir. Interpretation: Twice-daily dolutegravir was shown to be safe and sufficient to overcome the rifampicin enzyme-inducing effect in children, and could provide a practical ART option for children with HIV-associated TB

    Neuropsychiatric manifestations and sleep disturbances with dolutegravir-based antiretroviral therapy versus standard of care in children and adolescents: a secondary analysis of the ODYSSEY trial

    Get PDF
    BACKGROUND: Cohort studies in adults with HIV showed that dolutegravir was associated with neuropsychiatric adverse events and sleep problems, yet data are scarce in children and adolescents. We aimed to evaluate neuropsychiatric manifestations in children and adolescents treated with dolutegravir-based treatment versus alternative antiretroviral therapy. METHODS: This is a secondary analysis of ODYSSEY, an open-label, multicentre, randomised, non-inferiority trial, in which adolescents and children initiating first-line or second-line antiretroviral therapy were randomly assigned 1:1 to dolutegravir-based treatment or standard-of-care treatment. We assessed neuropsychiatric adverse events (reported by clinicians) and responses to the mood and sleep questionnaires (reported by the participant or their carer) in both groups. We compared the proportions of patients with neuropsychiatric adverse events (neurological, psychiatric, and total), time to first neuropsychiatric adverse event, and participant-reported responses to questionnaires capturing issues with mood, suicidal thoughts, and sleep problems. FINDINGS: Between Sept 20, 2016, and June 22, 2018, 707 participants were enrolled, of whom 345 (49%) were female and 362 (51%) were male, and 623 (88%) were Black-African. Of 707 participants, 350 (50%) were randomly assigned to dolutegravir-based antiretroviral therapy and 357 (50%) to non-dolutegravir-based standard-of-care. 311 (44%) of 707 participants started first-line antiretroviral therapy (ODYSSEY-A; 145 [92%] of 157 participants had efavirenz-based therapy in the standard-of-care group), and 396 (56%) of 707 started second-line therapy (ODYSSEY-B; 195 [98%] of 200 had protease inhibitor-based therapy in the standard-of-care group). During follow-up (median 142 weeks, IQR 124–159), 23 participants had 31 neuropsychiatric adverse events (15 in the dolutegravir group and eight in the standard-of-care group; difference in proportion of participants with ≥1 event p=0·13). 11 participants had one or more neurological events (six and five; p=0·74) and 14 participants had one or more psychiatric events (ten and four; p=0·097). Among 14 participants with psychiatric events, eight participants in the dolutegravir group and four in standard-of-care group had suicidal ideation or behaviour. More participants in the dolutegravir group than the standard-of-care group reported symptoms of self-harm (eight vs one; p=0·025), life not worth living (17 vs five; p=0·0091), or suicidal thoughts (13 vs none; p=0·0006) at one or more follow-up visits. Most reports were transient. There were no differences by treatment group in low mood or feeling sad, problems concentrating, feeling worried or feeling angry or aggressive, sleep problems, or sleep quality. INTERPRETATION: The numbers of neuropsychiatric adverse events and reported neuropsychiatric symptoms were low. However, numerically more participants had psychiatric events and reported suicidality ideation in the dolutegravir group than the standard-of-care group. These differences should be interpreted with caution in an open-label trial. Clinicians and policy makers should consider including suicidality screening of children or adolescents receiving dolutegravir

    Evolutionary Feature Manipulation in Unsupervised Learning

    No full text
    Unsupervised learning is a fundamental category of machine learning that works on data for which no pre-existing labels are available. Unlike in supervised learning, which has such labels, methods that perform unsupervised learning must discover intrinsic patterns within data. The size and complexity of data has increased substantially in recent years, which has necessitated the creation of new techniques for reducing the complexity and dimensionality of data in order to allow humans to understand the knowledge contained within data. This is particularly problematic in unsupervised learning, as the number of possible patterns in a dataset grows exponentially with regard to the number of dimensions. Feature manipulation techniques such as feature selection (FS) and feature construction (FC) are often used in these situations. FS automatically selects the most valuable features (attributes) in a dataset, whereas FC constructs new, more powerful and meaningful features that provide a lower-dimensional space. Evolutionary computation (EC) approaches have become increasingly recognised for their potential to provide high-quality solutions to data mining problems in a reasonable amount of computational time. Unlike other popular techniques such as neural networks, EC methods have global search ability without needing gradient information, which makes them much more flexible and applicable to a wider range of problems. EC approaches have shown significant potential in feature manipulation tasks with methods such as Particle Swarm Optimisation (PSO) commonly used for FS, and Genetic Programming (GP) for FC. The use of EC for feature manipulation has, until now, been predominantly restricted to supervised learning problems. This is a notable gap in the research: if unsupervised learning is even more sensitive to high-dimensionality, then why is EC-based feature manipulation not used for unsupervised learning problems? This thesis provides the first comprehensive investigation into the use of evolutionary feature manipulation for unsupervised learning tasks. It clearly shows the ability of evolutionary feature manipulation to improve both the performance of algorithms and interpretability of solutions in unsupervised learning tasks. A variety of tasks are investigated, including the well-established task of clustering, as well as more recent unsupervised learning problems, such as benchmark dataset creation and manifold learning. This thesis proposes a new PSO-based approach to performing simultaneous FS and clustering. A number of improvements to the state-of-the-art are made, including the introduction of a new medoid-based representation and an improved fitness function. A sophisticated three-stage algorithm, which takes advantage of heuristic techniques to determine the number of clusters and to fine-tune clustering performance is also developed. Empirical evaluation on a range of clustering problems demonstrates a decrease in the number of features used, while also improving the clustering performance. This thesis also introduces two innovative approaches to performing wrapper-based FC in clustering tasks using GP. An initial approach where constructed features are directly provided to the k-means clustering algorithm demonstrates the clear strength of GP-based FC for improving clustering results. A more advanced method is proposed that utilises the functional nature of GP-based FC to evolve more specific, concise, and understandable similarity functions for use in clustering algorithms. These similarity functions provide clear improvements in performance and can be easily interpreted by machine learning practitioners. This thesis demonstrates the ability of evolutionary feature manipulation to solve unsupervised learning tasks that traditional methods have struggled with. The synthesis of benchmark datasets has long been a technique used for evaluating machine learning techniques, but this research is the first to present an approach that automatically creates diverse and challenging redundant features for a given dataset. This thesis introduces a GP-based FC approach that creates difficult benchmark datasets for evaluating FS algorithms. It also makes the intriguing discovery that using a mutual information-based fitness function with GP has the potential to be used to improve supervised learning tasks even when the labels are not utilised. Manifold learning is an approach to dimensionality reduction that aims to reduce dimensionality by discovering the inherent lower-dimensional structure of a dataset. While state-of-the-art manifold learning approaches show impressive performance in reducing data dimensionality, they do so at the cost of removing the ability for humans to understand the data in terms of the original features. By utilising a GP-based approach, this thesis proposes new methods that can perform interpretable manifold learning, which provides deep insight into patterns in the data. These four contributions clearly support the hypothesis that evolutionary feature manipulation has untapped potential in unsupervised learning. This thesis demonstrates that EC-based feature manipulation can be successfully applied to a variety of unsupervised learning tasks with clear improvements in both performance and interpretability. A plethora of future research directions in this area are also discovered, which we hope will lead to further valuable findings in this area

    Evolutionary Feature Manipulation in Unsupervised Learning

    No full text
    Unsupervised learning is a fundamental category of machine learning that works on data for which no pre-existing labels are available. Unlike in supervised learning, which has such labels, methods that perform unsupervised learning must discover intrinsic patterns within data. The size and complexity of data has increased substantially in recent years, which has necessitated the creation of new techniques for reducing the complexity and dimensionality of data in order to allow humans to understand the knowledge contained within data. This is particularly problematic in unsupervised learning, as the number of possible patterns in a dataset grows exponentially with regard to the number of dimensions. Feature manipulation techniques such as feature selection (FS) and feature construction (FC) are often used in these situations. FS automatically selects the most valuable features (attributes) in a dataset, whereas FC constructs new, more powerful and meaningful features that provide a lower-dimensional space. Evolutionary computation (EC) approaches have become increasingly recognised for their potential to provide high-quality solutions to data mining problems in a reasonable amount of computational time. Unlike other popular techniques such as neural networks, EC methods have global search ability without needing gradient information, which makes them much more flexible and applicable to a wider range of problems. EC approaches have shown significant potential in feature manipulation tasks with methods such as Particle Swarm Optimisation (PSO) commonly used for FS, and Genetic Programming (GP) for FC. The use of EC for feature manipulation has, until now, been predominantly restricted to supervised learning problems. This is a notable gap in the research: if unsupervised learning is even more sensitive to high-dimensionality, then why is EC-based feature manipulation not used for unsupervised learning problems? This thesis provides the first comprehensive investigation into the use of evolutionary feature manipulation for unsupervised learning tasks. It clearly shows the ability of evolutionary feature manipulation to improve both the performance of algorithms and interpretability of solutions in unsupervised learning tasks. A variety of tasks are investigated, including the well-established task of clustering, as well as more recent unsupervised learning problems, such as benchmark dataset creation and manifold learning. This thesis proposes a new PSO-based approach to performing simultaneous FS and clustering. A number of improvements to the state-of-the-art are made, including the introduction of a new medoid-based representation and an improved fitness function. A sophisticated three-stage algorithm, which takes advantage of heuristic techniques to determine the number of clusters and to fine-tune clustering performance is also developed. Empirical evaluation on a range of clustering problems demonstrates a decrease in the number of features used, while also improving the clustering performance. This thesis also introduces two innovative approaches to performing wrapper-based FC in clustering tasks using GP. An initial approach where constructed features are directly provided to the k-means clustering algorithm demonstrates the clear strength of GP-based FC for improving clustering results. A more advanced method is proposed that utilises the functional nature of GP-based FC to evolve more specific, concise, and understandable similarity functions for use in clustering algorithms. These similarity functions provide clear improvements in performance and can be easily interpreted by machine learning practitioners. This thesis demonstrates the ability of evolutionary feature manipulation to solve unsupervised learning tasks that traditional methods have struggled with. The synthesis of benchmark datasets has long been a technique used for evaluating machine learning techniques, but this research is the first to present an approach that automatically creates diverse and challenging redundant features for a given dataset. This thesis introduces a GP-based FC approach that creates difficult benchmark datasets for evaluating FS algorithms. It also makes the intriguing discovery that using a mutual information-based fitness function with GP has the potential to be used to improve supervised learning tasks even when the labels are not utilised. Manifold learning is an approach to dimensionality reduction that aims to reduce dimensionality by discovering the inherent lower-dimensional structure of a dataset. While state-of-the-art manifold learning approaches show impressive performance in reducing data dimensionality, they do so at the cost of removing the ability for humans to understand the data in terms of the original features. By utilising a GP-based approach, this thesis proposes new methods that can perform interpretable manifold learning, which provides deep insight into patterns in the data. These four contributions clearly support the hypothesis that evolutionary feature manipulation has untapped potential in unsupervised learning. This thesis demonstrates that EC-based feature manipulation can be successfully applied to a variety of unsupervised learning tasks with clear improvements in both performance and interpretability. A plethora of future research directions in this area are also discovered, which we hope will lead to further valuable findings in this area

    Evolving Simpler Constructed Features for Clustering Problems with Genetic Programming

    No full text
    Clustering is a widely used unsupervised learning technique. However, as the size and complexity of data increases, the performance of clustering algorithms diminishes, as well as the interpretability of the clustering partition. Genetic programming has been used to perform feature construction on data to increase clustering performance. However, existing work has not focused on encouraging simpler constructed features. In this paper, existing techniques are further developed to include parsimony pressure-a method to encourage evolution towards simpler solutions. With simpler solutions, the constructed features become easier to understand and interpret. The results of experiments using the proposed method show that parsimony pressure is an effective method for producing significantly simpler constructed features without any reduction on the performance of k-means++clustering. Evolved individuals are also analysed to demonstrate the effect of parsimony pressure on interpretability, showing the power of parsimony pressure for avoiding redundancies in individuals, and thus increasing the interpretability
    corecore